Univariate Plots Section

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar 
##  Min.   : 4.600   Min.   :0.1200   Min.   :0.0000   Min.   :0.900  
##  1st Qu.: 7.100   1st Qu.:0.3950   1st Qu.:0.0900   1st Qu.:1.900  
##  Median : 7.900   Median :0.5200   Median :0.2500   Median :2.200  
##  Mean   : 8.259   Mean   :0.5288   Mean   :0.2661   Mean   :2.409  
##  3rd Qu.: 9.100   3rd Qu.:0.6400   3rd Qu.:0.4200   3rd Qu.:2.600  
##  Max.   :13.200   Max.   :1.5800   Max.   :1.0000   Max.   :8.300  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 21.25      
##  Median :0.07900   Median :13.00       Median : 37.00      
##  Mean   :0.08699   Mean   :15.17       Mean   : 44.52      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 60.00      
##  Max.   :0.61100   Max.   :46.00       Max.   :144.00      
##     density             pH          sulphates         alcohol      quality
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40   3: 10  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   4: 52  
##  Median :0.9967   Median :3.310   Median :0.6200   Median :10.20   5:650  
##  Mean   :0.9967   Mean   :3.316   Mean   :0.6569   Mean   :10.43   6:614  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7275   3rd Qu.:11.10   7:190  
##  Max.   :1.0029   Max.   :4.010   Max.   :2.0000   Max.   :14.00   8: 18
## 'data.frame':    1534 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...

Now let’s see how the data dispersed for each variable by the following histograms:

From this histograms we can get a genral idea about the dataset we have, for example: We can see from Quality histogram that most selected rate is 5 and 6 but 6 is the highiest.In addation the rate 8 is highiest rate in the dataset.

Now to make the distributions more normal for some histograms above we use logarithmic transformation (log 10) for them to reduce skew. We can see that total sulfur dioxide , fixed acidity and sulphates have long distribution and make them looks better on the graph we do the following:

##      nbr.val     nbr.null       nbr.na          min          max 
## 1.534000e+03 0.000000e+00 0.000000e+00 3.300000e-01 2.000000e+00 
##        range          sum       median         mean      SE.mean 
## 1.670000e+00 1.007700e+03 6.200000e-01 6.569100e-01 4.339978e-03 
## CI.mean.0.95          var      std.dev     coef.var 
## 8.512921e-03 2.889351e-02 1.699809e-01 2.587583e-01
##       nbr.val      nbr.null        nbr.na           min           max 
##  1.534000e+03  1.000000e+00  0.000000e+00 -4.814861e-01  3.010300e-01 
##         range           sum        median          mean       SE.mean 
##  7.825161e-01 -2.979098e+02 -2.076083e-01 -1.942046e-01  2.476082e-03 
##  CI.mean.0.95           var       std.dev      coef.var 
##  4.856866e-03  9.404925e-03  9.697899e-02 -4.993651e-01

Transformed the long tailed sulphates data for a more accurate distribution. The log10 produces a relatively normal distribution. Variance decreases for log10 sulphates and graph looks more normal.

##      nbr.val     nbr.null       nbr.na          min          max 
## 1.534000e+03 0.000000e+00 0.000000e+00 4.600000e+00 1.320000e+01 
##        range          sum       median         mean      SE.mean 
## 8.600000e+00 1.266870e+04 7.900000e+00 8.258605e+00 4.167883e-02 
## CI.mean.0.95          var      std.dev     coef.var 
## 8.175356e-02 2.664750e+00 1.632406e+00 1.976612e-01
##      nbr.val     nbr.null       nbr.na          min          max 
## 1.534000e+03 0.000000e+00 0.000000e+00 6.627578e-01 1.120574e+00 
##        range          sum       median         mean      SE.mean 
## 4.578161e-01 1.394122e+03 8.976271e-01 9.088150e-01 2.124008e-03 
## CI.mean.0.95          var      std.dev     coef.var 
## 4.166269e-03 6.920504e-03 8.318956e-02 9.153630e-02

Fixed acidity appear to be long tailed too, and transforming its log appears to make it closer to a normal distribution. Variances are confirmed to be a relevant decrease for fixed acidity.

##      nbr.val     nbr.null       nbr.na          min          max 
## 1.534000e+03 0.000000e+00 0.000000e+00 1.200000e-01 1.580000e+00 
##        range          sum       median         mean      SE.mean 
## 1.460000e+00 8.111800e+02 5.200000e-01 5.288005e-01 4.533375e-03 
## CI.mean.0.95          var      std.dev     coef.var 
## 8.892273e-03 3.152599e-02 1.775556e-01 3.357705e-01
##       nbr.val      nbr.null        nbr.na           min           max 
##  1.534000e+03  3.000000e+00  0.000000e+00 -9.208188e-01  1.986571e-01 
##         range           sum        median          mean       SE.mean 
##  1.119476e+00 -4.633255e+02 -2.839967e-01 -3.020375e-01  3.882498e-03 
##  CI.mean.0.95           var       std.dev      coef.var 
##  7.615568e-03  2.312319e-02  1.520631e-01 -5.034577e-01

Volatile acidity appear to be long tailed also, and transforming its log appears to make it closer to a normal distribution like others above. Since pH is a logarithmic term, and is normal, then it would be sense for the log of acidity levels to also be approximately normal. Variances are confirmed to be a relevant decrease for it but not entirely.

##      nbr.val     nbr.null       nbr.na          min          max 
## 1.534000e+03 0.000000e+00 0.000000e+00 6.000000e+00 1.440000e+02 
##        range          sum       median         mean      SE.mean 
## 1.380000e+02 6.829000e+04 3.700000e+01 4.451760e+01 7.624117e-01 
## CI.mean.0.95          var      std.dev     coef.var 
## 1.495480e+00 8.916706e+02 2.986085e+01 6.707651e-01
##      nbr.val     nbr.null       nbr.na          min          max 
## 1.534000e+03 0.000000e+00 0.000000e+00 7.781513e-01 2.158362e+00 
##        range          sum       median         mean      SE.mean 
## 1.380211e+00 2.379277e+03 1.568202e+00 1.551028e+00 7.633760e-03 
## CI.mean.0.95          var      std.dev     coef.var 
## 1.497372e-02 8.939277e-02 2.989862e-01 1.927665e-01

Transformed the long tailed total sulfur dioxide data for a more accurate distribution. The log10 produces a relatively normal distribution for it. Total sulfur dioxide variance decreases significantly and as such appears to be nearly normal.

What about Quality ?

From the previews as we said we can see that most ratings are 6 and 5, to make a histogram that provide more value to us we can divide these ratings into categories like the following

##       bad   average excellent 
##        62      1264       208

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations in total and 1534 observations after removing the top 1% from the variables that had large outliers.

What is/are the main feature(s) of interest in your dataset?

Quality and alcohol is the main features.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I see that residual sugar and alcohol will play main role in the wine quality and taste.

Did you create any new variables from existing variables in the dataset?

Yes, Three variables from the quality variable: (0< bad <5), (5 =< average <7), (excellent= 7 & 8)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • Performing logarithmic transformation on the following features : 1-Sulphates 2-fixed acidity 3-total/free sulfur dioxide

  • Removing top 1% of values of some features like fixed acidity, residual sugar, total sulfur dioxide, and free sulfur dioxide.

  • The first column was removed because it was an index for the observations.

Bivariate Plots Section

Noe we’ll create correlation matrix to see the correlations between two variables.

We are going to see the relation between the acidities and pH where it’s appear that the correlation coefficient is -0.67

## [1] -0.6794406

Now the correlation between citric acid and pH (-0.52)

## [1] -0.5283267

The correlation coefficient between volatile acidity and pH is 0.23

## [1] 0.2387919

The correlation coefficient between volatile acidity and citric acid is -0.56

## [1] -0.5629224

The correlation coefficient between alcohol and pH is 0.21

## [1] 0.2166557

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Some intresting Points: Lower pH indicates a higher acidity. The more citric acid get higher the more sulphates will get higher as well. The colleration between Volatile acidity and citric acid is negative. The colleration between Citric acid and pH is negative. *pH and alcohol are very weakly correlated.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Volatile acidity and citric acid were negatively correlated, as were citric acid and pH. Fixed acidity and pH were negatively correlated, due to the lower pH/more acidic effect.

What was the strongest relationship you found?

Citric Acid and Volatile Acidity, which had a correlation coefficient of -0.563.

Multivariate Plots Section

Next we are going to see the relationship between two variables based on the quality starting with Citric acid and Alcohol using scatterplots.

Citric acid and Alcohol

Now with Sulphates and Alcohol

Most bad wines seem to have higher levels of volatile acidity, and most excellent wines also had lower levels of volatility.

## redWine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5800  0.6800  0.7306  0.8838  1.5800 
## -------------------------------------------------------- 
## redWine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5386  0.6400  1.3300 
## -------------------------------------------------------- 
## redWine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3100  0.3700  0.4090  0.4925  0.9150

Between volatile acidity with sulphates, it’s clear that excellent wines have a lower volatile acidity and a higher sulphates content and bad wines have a higher volatile acidity content and lower sulphates content.

## redWine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.5800  0.6800  0.7306  0.8838  1.5800 
## -------------------------------------------------------- 
## redWine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.4100  0.5400  0.5386  0.6400  1.3300 
## -------------------------------------------------------- 
## redWine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3100  0.3700  0.4090  0.4925  0.9150

Next plot is intresting … Probability of the wine being excellent is zero when volatile acidity is greater than 1. When volatile acidity is either 0 or 0.3, there is roughly a 40% probability that the wine is excellent. When volatile acidity is between 1 and 1.2 there is an 80% chance that the wine is bad. Any wine with a volatile acidity greater than 1.4 has a 100% chance of being bad.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Some intresting Points:

Most bad wines seem to have higher levels of volatile acidity. Most excellent wines also had lower levels of volatility. Excellent wines have a lower volatile acidity and a higher sulphates content. Bad wines have a higher volatile acidity content and lower sulphates content.

Final Plots and Summary

Plot One: pH and Quality

## redWine$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.60   10.00   10.20   10.97   13.10 
## -------------------------------------------------------- 
## redWine$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.50   10.00   10.26   10.90   14.00 
## -------------------------------------------------------- 
## redWine$rating: excellent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.50   10.80   11.60   11.54   12.22   14.00

Description One

This graph shows the relationship between the pH and Quality ammounts so we can see form it that the lower pH level the more quality increase and higher pH level the more quality decrease.

Plot Two: Volatile Acidity vs Quality

Description Two

Probability of the wine being excellent is zero when volatile acidity is greater than 1. When volatile acidity is either 0 or 0.3, there is roughly a 40% probability that the wine is excellent. When volatile acidity is between 1 and 1.2 there is an 80% chance that the wine is bad. Any wine with a volatile acidity greater than 1.4 has a 100% chance of being bad.

Plot Three: Alcohol & Sulphates vs. Quality

Description Three

Bad wine has lower sulphates and alcohol level varying between 9% and 12%. Average wines have higher concentrations of sulphates. *wines that are rated 6 tend to have higher alcohol content and larger sulphates content.

*This graph makes it fairly clear that both sulphates and alcohol content contribute to quality.

Reflection

The data set contains information on 1,599 red wines.Due to a large number of different chemicals variables, I made assumptions that some variables have a relationship with each other which is true like pH was negatively correlated to volatile acidity which makes sense. Also alcohol levels appeared to be the most important for determining high quality wine. Volatile acidity made a wine bad in large amounts, regardless of the circumstances. And this makes sense as large amounts of acetic acid create a bitter taste.

We can say that there is a weaknesses in this data due to biases in the wine tasters’ preferences. When the wine tasters be experts, they tend to look for advanced things in wine than the noraml person.

Struggles / Successes

The best part of this project and for me the main success was exploring and somehow predicting a wine quality with a few technical variables without actually tasting it. Just by exploring data, anyone can figure out basic trends. I struggled because of the lack of expericnce in wines contents and what do they mean and also I struggled with choosing the most appropriate graph for a each context during the analsys.

In the future work an expert reviews could be added to improve the dataset. Getting feedback from reviewers with explanation of how these reviewers rate a wine may add a value to the analsys process.